October 2017

:~$ whoami

Matthias Bannert

  • current occupation: data scientist / software developer @ETH Zurich
  • occasional consultant
  • studied economics @UniKN, PhD @ETHZ: partly economics, mostly methodology + stats
  • CTO of Swiss startup fanpictor from 2012-2014
  • open source software projects: timeseriesdb, tstools, dropR, RAdwords
matthias bannert

About this course

Approach

# listen - forget
# see - remember
# do - understand

Goals

  • plan
  • apply
  • scale

Overview

  • Day 1: Organize
    • Introduction
    • Data Generating Processes
    • Types of Data
    • Manage and Archive
  • Day 2: Process and Communicate
    • Visualization
    • Methodology

Background Poll

Inspiration: Illustrate

mobile evolution

Inspiration: Relation

million lines

Inspiration: Choropleth

five percent

Inspiration: Draw R

Inspiration: Process Data

  • download automatically
  • read spreadsheet
  • process
  • visualize

Inspiration: Dynamic Reporting / Presentations

  • create report
  • dynamic figures & tables
  • html, pdf, docx

Data Analytics Toolbox

Quelle: all Logos taken from their respective companies' website.

Getting Started

"Premature optimization is the root of all evil."
Donald Knuth

But …

The R Language for Statistical Computing

  • First appeared in 1993
  • designed by Ihaka and Gentleman
  • Last Stable Release: 3.4.1

Why R?

  • interpreted language
  • interfaces to many compiled languages
  • easy to learn
  • open source, license cost free
  • backed by Microsoft
  • one-of-a-kind ecosystem, wide range of packages

The R Ecosystem

The R Studio IDE

  • Switch to LTR Layout
  • Console vs. Scripting window
  • comments
  • short cut cmd+enter: run selection
  • short cut ctrl+1, ctrl+2: switch windows
  • short cut ctrl+L: clear console screen
  • short cut command+D: multiple cursors @instances
  • file explorer
  • plot window
  • .Rproj

Basic R Objects

  • vector
  • matrix
  • data.frame
  • list

Brackets and braces

  • [row,col]: Index
  • {}: function or loop body
  • (): function parameters

Basic functions I

  • ls()
  • rm()
  • c()
  • matrix()
  • data.frame()
  • list()

getting help: ?function name

Basic functions II

  • head()
  • tail()
  • str()
  • function()
  • lapply()
  • data()

getting help: ?function name

Before you start …

Good habits: Snakes …

  • i_am_a_snake

and camels

  • jeSuisUnCamel

Task I: Working on a built-in dataset

  1. How many observations does the dataset mtcars have?
  2. What's the miles-per-gallon average, median?
  3. Which is the most ecological car?
  4. Which is the most ecological car by cylinders?
  5. How is mpg distributed?
  6. Why does solving analytics exercises through programming make sense?

Summary I

  • scripting language is good start
  • understanding a language helps to remember syntax
  • many tasks can be solved w/o database, larger stack
  • programming makes tasks scalable and reproducible

How About Real Data ?

Data Generating Processes: Simulation

Data Generating Processes: Logging

  • sources: Webservers, IoT devices
  • event based files
  • not aggregated, large amounts of data

solutions:

  • specific tools: awstats
  • SaaS products
  • programming

source tagesanzeiger.ch

Data Generating Processes: tracking

  • user specific logging
  • data similar to log data
  • e-commerce tracking
  • often received through APIs in software as-a-service (Saas) products
    • Google Analytics
    • Google Adwords
    • Adsense

source: http://gantalcala.org/

Data Generating Processes: surveys

  • input: online forms, smartphone apps, interviews, paper
  • result in cross sectional or panel data

source: pinterest

Data Generating Processes: download, (web) scraping

  • web scraping
  • scraping spreadsheets, html tables
  • image processing
  • prices from car vendor
  • text mining

source: youtube, spaceballs

A Word on APIs

  • REST popular on the web
  • SOAP a bit more old fashioned, but used in B2B often
  • APIs are an easy choices when receiving third party data
    • can be automatically processes
    • conflict with security policies

Task II - Discussion: Security

Discuss in smaller groups:

  • Does using scripts to access the web conflict with your department's policies?
  • What would be a good option to receive 3rd party data?
  • How could 3rd party data be merged with internal data?
  • Where do you stand on cloud computing?

Types of datasets: time series

ts2 <- ts(rnorm(20),
          start = c(1995,1),
          frequency = 4)
ts2
##            Qtr1       Qtr2       Qtr3       Qtr4
## 1995 -0.3625870 -1.3589722  2.9884704 -0.3030327
## 1996  1.3504988  0.2517712 -1.3133845 -0.5131971
## 1997 -0.4809161 -0.9930176  0.2278636  0.2560794
## 1998 -0.4265592 -0.2902753 -1.8648031 -1.4235040
## 1999 -0.3613658 -0.4917637  1.4510953 -1.1790192

examples: monthly revenues over time, stocks, aggregated log files

Types of datasets: cross sectional data

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  • multiple variables
  • one period

Types of datasets: panel data

  • multiple variables
  • longitudinal data
  • e.g. German Socio-Economic Panel (GSOEP)

Nested data structures

l <- list()
l$element1 <- 2
l$element2 <- head(mtcars,4)
l
## $element1
## [1] 2
## 
## $element2
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

examples: meta information, sector classification (hierarchical), GDP components, translations, attributes, properties

Task III: Dataset types

Represent each of the following dataset types using R:

  • time series
  • cross section
  • panel
  • a nested structure.

Please suggest an in-memory representation and a file based representation.

A Short Note On Data Types in R

  • character
  • numeric
  • factor (be warned of stringsAsFactor = T)